require(tidyverse)
require(DiagrammeR)
require(poppr)
require(genepop)
require(graph4lg)
require(related)
require(adegenet)
require(knitr)
This is document is an R notebook. If you’d like view to pre-rendered figures, read a summary of analysis and interact with code, please open the relevant html file in a browser.
To conduct a similar analyses on your computer, edit or run code: clone this repository into a directory on you r local machine and open the .Rproj file in Rstudio.
Genotyping-in-Thousands by sequencing (GT-seq) is a cost effective method developed by Campbell et al (2015) to genotype up to thousands of samples at a small (~500) set of markers using the illumina platform.
This document:
(a) outlines the bioinformatic procedures for processing raw GT-seq reads to a fully filtered SNP dataset according to the standards adopted by the State Fisheries Genomics Lab at Oregon State University
(b) shares instructions for setting up the environment to run the scripts
(c) provides an example detailed protocol genotyping Oncorhynchus mykiss samples with figures and results
(d) directs GT-seq users to relevant metadata to conduct their own genotyping (i.e probe sequences)
grViz("digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
tab6 [label = '@@6']
tab7 [label = '@@7']
tab8 [label = '@@8']
# edge definitions with the node IDs
tab1 -> tab2 [label = 'GTseq_BarcodeSplit']
tab2 -> tab3 [label = 'GTseq_Genotyper']
tab3 -> tab4 [label = 'GTseq_genocompile']
tab4 -> tab5 [label = 'filtering']
tab4 -> tab6 [label = 'GTseq_summaryfigs']
tab3 -> tab7 [label = 'GTseq_genocompilecounts']
tab7 -> tab5
tab1 -> tab8 [label = 'GTseq_seqtest']
tab7 -> tab4 [label = 'GTseq_Omysex']
tab6 -> tab5
}
[1]: 'Raw Reads'
[2]: 'Demultiplezed fastqs'
[3]: 'Individual genotypes'
[4]: 'raw GTseq dataset'
[5]: 'final filtered dataset'
[6]: 'summary figures'
[7]: 'read counts'
[8]: 'marker info'
")
Full Gt-seq genotyping flowchart: nodes are data, edges are scripts
grViz("digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
# edge definitions with the node IDs
tab1 -> tab2 [label = 'GTseq_BarcodeSplit']
tab2 -> tab3 [label = 'GTseq_Genotyper']
tab3 -> tab4 [label = 'GTseq_genocompile']
tab4 -> tab5 [label = 'filtering']
}
[1]: 'Raw Reads'
[2]: 'Demultiplezed fastqs'
[3]: 'Individual genotypes'
[4]: 'raw GTseq dataset'
[5]: 'final filtered dataset'
")
Typical Gt-seq genotyping flowchart when using established panel: nodes are data, edges are scripts
The GTseq pipeline consists of a handful of perl and python scripts to process raw sequencing reads to a fully filtered SNP dataset. Here’s a quick reference to each of the steps in a typical pipeline.
Sequencing QC
- Check that the reads look good from the sequencer with fastqc or similar program
GT-seq Scripts
- Demultiplex reads: The first step is to demultiplex sequencing data if demuxed files are not provided by sequencing center using the GTseq_Barcode_Split_MP.py script or a different demultiplexing approach such as deML
- Call genotypes: Use GTseq_Genotyper_v3.1.pl to call genotypes on demultiplexed reads.
- (Optional) Call Sex Genotypes: If species has unique script for calling the sex markers (O. mykiss and O. tshawytscha), run these as well
- Compile: After all the individual genotypes are called, compile them into a single output using the GTseq_GenoCompile_v3.pl script.
QAQC Check:
- Check positive and negative controls (for plate flipping, other library prep errors)
- Check known genotype controls if included (het winter summer controls)
- Check technical replicates
- Remove controls and replicates if all looks good
Filtering:
The final filtering cutoffs are below:
This SOP is run in two environments, the OSU cluster and a local instance of R. You can open this document as a .rmd in Rstudio when processing your own data, or simply copy and paste the code found here into your own terminal.
A few optional code chunks also use the terminal on a unix based machine (i.e. a mac), but they can also run entirely on the cluster, or on a windows based machine with a little work (just run the unix based commands in a unix/bash based terminal on a windows machine like cygwin or the new bash shell available on windows 10 install link here ).
Commands sent to the terminal appear as code chunks in this SOP and narrative is provided in the main body. Click the gray “code” box to see the commands and more details about a step. The appropriate environment for running each code chunk will appear as a comment at the head the chunk. There are also full scripts submitted as jobs to the server. These need to be saved as a file and submitted with the appropriate command from the server scheduler such as qsub or SGE_Batch.
Copying the directory in this repository named “example data” will provide all the relevant files to run the local portion (R) of the example protocol. See the code chunk below on how to do this.
# Local
# note: this is optional and only needed if you want to follow along with the example code on your own machine
# move to your working directory
# create a directory for this repository and move inside
mkdir GT-seq
cd GT-seq
#copy the whole repository using git clone
git clone https://github.com/State-Fisheries-Genomics-Lab/GT-seq.git
# after you've cloned the repository, open the file GT-seq.Rproj in Rstudio for all features
Finally, paths to files and scripts in the SOP are on David Dayan’s directories, you will need to modify the scripts to reflect your own paths.
Scripts
In order to run the GTseq, you must have a copy of the most current GTseq scripts stored on the cluster. As there are multiple versions of the scripts, it is reccomended to download them into directory on the cluster directly from the github repository, which is actively maintained.
Option A: File transfer GUI
Move all the files from the script directory to your own directory on the server using filezilla or whatever file transfer software you like.
Option B: Git
You can also clone whole repository onto the server with git and then move relevant files to where you want them
# SERVER
# clone the repository somewhere temporary
mkdir ~/tmp
cd ~/tmp
git clone https://github.com/State-Fisheries-Genomics-Lab/GT-seq
# choose home directory for your software
# make directory for GTseq scripts and move files inside
mkdir /dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/
mv ~/tmp/GT-seq/GT-seq_scripts/* /dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/
#delete your temporary repo
bash #default shell is zcsh which requires you to confirm EVERY SINGLE file deletion
rm -r ~/tmp/* #caution scary, be sure there's no typo here
rmdir ~/tmp
Perl and Python Libraries
The GT-seq pipeline uses both perl and python. All of the relevant libraries should be installed on the server except StringApprox. You will need to install StringApprox and configure your server environment so it is included it in your path.
Installing:
# SERVER
# You have three options to install String::Approx, I confirmed the first one works, but the others may be even simpler
#### Option 1: make
# first download the tarball for String Approx here ( https://metacpan.org/release/String-Approx ) into a temporary directory on the server
# then, unpack the tarball and follow the instructions in the readme (below)
perl Makefile.PL
make
make test
make install
#### Option 2: Conda
conda install -c bioconda perl-string-approx
#### Option 3: cpan
perl -MCPAN -e shell
install String::Approx
Setting the environment:
Before running any scripts that require the String::Approx module, you must set the environment with the following commands. This will be noted in the example scripts but can also be found here.
#SERVER
setenv PERL5LIB ~/perl5/lib/perl5/x86_64-linux-thread-multi/ #for tcsh (default shell on server)
export PERL5LIB='/home/fw/dayand/perl5/lib/perl5/x86_64-linux-thread-multi/' # if using bash as your shell
This repository contains all the files needed to run the pipeline on new data and additional information about the panel markers. See readmes in each subdirectory for up to date contents.
Probe_Sequences
The probe sequences used by the genotyper script to call SNPs at the panel markers. Link here
Panel Info
Information about the panel markers such as amplicon sequences, annotations, sources, primer and probe sequences etc. Link here
Marker Mapping
Results from mapping studies. Link here
Here is an example script of a full genotyping pipeline from raw reads to fully filtered SNP dataset.
The example dataset uses fall run and half-pounder steelhead from the Rogue River.
Sample Summary
Let’s gather all the metadata and summarize:
# LOCAL R
#intake files
half_2019_intake <- readxl::read_xlsx("example_data/metadata/STHP Intake form Spread sheet 2019.xlsx", sheet = 1)
fall_intake <- readxl::read_xlsx("example_data/metadata/Rogue Adult Summer and Winter (06 11 20).xlsx", sheet = 1)
#merge intakes
# first clean them up a bit to make merging easier
half_2019_intake <- half_2019_intake[,c(1,2)]
half_2019_intake$run <- "halfpounder"
half_2019_intake$year <- "2019"
colnames(half_2019_intake) <- c("ID", "Date", "run", "year")
fall_intake <- subset(fall_intake, Run == "Summer")
fall_intake <- fall_intake[,c(1,3,6)]
colnames(fall_intake) <- c("ID", "Date", "run")
fall_intake$run <- "fall"
fall_intake$year <- "2019"
meta_data <- bind_rows( half_2019_intake, fall_intake)
kable(meta_data %>%
group_by(run, year) %>%
tally())
| run | year | n |
|---|---|---|
| fall | 2019 | 166 |
| halfpounder | 2019 | 331 |
Sequencing Data Summary
Data came demultiplexed from sequencing center. Location in code chunk below
# SERVER
/dfs/FW_HMSC/Omalley_Lab/dayan/half_pounder/genotyping/demux - 2019 fall and 2019 halfpounder
Clusters: 409,149,535
Yield (mbase): 61,782
% >Q30 bases: 87.13
Average Qual: 37.18
Ran fastqc report (not included in SOP)
The first step is to demultiplex the raw sequencing file using the i5 and i7 indexes. We can actually skip this because the sequencing center already performed a demux. Instead, we just copy the demuxed files and give them more reasonable name.
# SERVER
#first move the demuxed files over
cp /nfs2/hts/illumina/200723_J00107_0245_AHHLG2BBXY_1504/L23/*Omy* /dfs/FW_HMSC/Omalley_Lab/dayan/half_pounder/genotyping/demux
cp /nfs2/hts/illumina/200723_J00107_0245_AHHLG2BBXY_1504/L23/*positive* /dfs/FW_HMSC/Omalley_Lab/dayan/half_pounder/genotyping/demux
cp /nfs2/hts/illumina/200723_J00107_0245_AHHLG2BBXY_1504/L23/negative* /dfs/FW_HMSC/Omalley_Lab/dayan/half_pounder/genotyping/demux
#next rename to something easier to work with, this command takes the last 20 characters and tosses the rest
for file in ./*
do
mv "$file" "${file:20}"
done
If, instead you are supplied with multiplexed fastq files (boo) you can use the GTseq scripts to demultiplex. First you need to generate a key file.
# LOCAL R
# example: Sample,PlateID,i7_name,i7_sequence,i5_name,i5_sequence
# Sample123,P1234,i001,ACCGTA,25,CCCTAA
# Sample321,P1234,i001,ACCGTA,26,GGCACA
#first lets get the index sequences
index2020 <- read_tsv("example_data/metadata/index_2020.txt")
colnames(index2020) <- c("Sample","PlateID","i7_name","i7_sequence","i5_name","i5_sequence")
write_csv(index2020, "example_data/metadata/index_2020_lane.csv")
Then you run the script.
# SERVER
#note this is from Sandra's protocol so the directories are different, but still same approach
#Set directory
usedir='/nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/OmyRogue'
#First argument is list of barcodes and second argument is .fastq file.
#Make sure you dos2unix the barcode file if you made it in Excel or Windows!
mkdir $usedir'/sample_fastqs'
cd $usedir'/sample_fastqs'
#here is the command you will submit
python '/nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/pipeline/GTseq_BarcodeSplit_MPargs.py' $usedir'/OmyROGR_index-list.csv' '/nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/RAW_READS/OmyROGROtsFERHCLAR.fq'
#here is how to wrap it correctly for submission to the server queue
SGE_Batch -q harold -r omysex -c "python '/nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/pipeline/GTseq_BarcodeSplit_MPargs.py' $usedir'/OmyROGR_index-list.csv' '/nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/RAW_READS/OmyROGROtsFERHCLAR.fq'"
Main Genotyper
Next we’ll run the GTseq genotyper (v3.1) script on each fastq file to generate the individual genotypes (.genos) . Note that there are many copies of the genotyper script. 3.1 is the current version (assymetrical genotype caller bug fixed)
We need to run this script for each demultiplexed fastq. To speed this up we’ll use an array job. The code chunk below contains an example array job script. To run the job, you’ll save the script on the server and run it using the qsub command.
The genotyper script requires unzipped fastq files for input. First code chunk below can decompress the input if your demultiplexed files are compressed
##### decompress script
# note 1: this is a script to save and submit as a job, save everything below the long ########### below
#note 2: the number of threads to use (-t option) is hardcoded to match the number of input files, change this number to reflect how many fastq.gz files you have
#################################
#!/bin/bash
#$ -S /bin/bash
#$ -t 1-554
#$ -tc 20
#$ -N decompress
#$ -cwd
#$ -o $JOB_NAME_$TASK_ID.out
#$ -e $JOB_NAME_$TASK_ID.err
FASTQS=(`ls *fastq.gz`)
INFILE=${FASTQS[$SGE_TASK_ID -1]}
gunzip -c $INFILE > ${INFILE%.gz}
#save as script and submit this with qsub -q harold scriptname
####################################
#!/bin/bash
#$ -S /bin/bash
#$ -t 1-554
#$ -tc 20
#$ -N GTseq-genotyperv3
#$ -cwd
#$ -o $JOB_NAME_$TASK_ID.out
#$ -e $JOB_NAME_$TASK_ID.err
export PERL5LIB='/home/fw/dayand/perl5/lib/perl5/x86_64-linux-thread-multi/'
FASTQS=(`ls /dfs/FW_HMSC/Omalley_Lab/dayan/half_pounder/genotyping/demux/*fastq`)
INFILE=${FASTQS[$SGE_TASK_ID -1]}
OUTFILE=$(basename ${INFILE%.fastq}.genos)
GTSEQ_GENO="/dfs/FW_HMSC/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_Genotyper_v3.1.pl
"
PROBE_SEQS="/dfs/FW_HMSC/Omalley_Lab/dayan/software/GTseq-Pipeline/Omy_GTseq390_ProbeSeqs.csv"
perl $GTSEQ_GENO $PROBE_SEQS $INFILE > $OUTFILE
#save this code chunk as a file on the server and submit this with qsub -q harold scriptname from the directory you want the output .genos files
Some notes about this script:
-t : this is the total number of task id to use. this script will iterate over this list. so if you have 100 individuals to genotype set this to 1-100.
-tc: this is the throttle. the value used in the script above (20), means we will submit the 554 separate commands in batches of 20 until it is done
submitting: I usually store all of the scripts I will submit to the server in a directory named “sge,” so if I wanted to submit this script (named gtseq_geno.txt) i’d submit the command qsub -q harold path/to/my/scripts/sge/gtseq_geno.txt . you can check the progress with the qstat command.
Sex Genotyper
After the genotypes are written for the panel, we add the sex genotyper.
# SERVER
# notes
# the omysex script (and otssex) is hardcoded to require the fastqs and genos to all be in a single collective directory, for fastqs to be decompressed, and for it to be run from this directory
# rather than try to fix this hardcoding in the script, a shortcut is to copy all the fastqs into the directory with your .genos from the previous step then deleted them after to conserve space
# another hardcoded part of the omysex script involves how it parses sample names all fastqs must be samplename.fastq and all genos must be samplename.genos
##### copy decompressed fastqs to directory with your .genos
cp /dfs/FW_HMSC/Omalley_Lab/dayan/half_pounder/genotyping/demux/*fastq ./
##### rename
#note: in previous SOP version an error caused output genos files to have the extension "fastq.genos", this fixes that error, but should be deprecated in the current version
# leaving it here in case it crops up again
for file in ./*fastq.genos
do
mv "$file" "${file%.fastq.genos}".genos
done
##### sex genotyper
#below is the command to run the omysex script
SGE_Batch -q harold -r omysex -c 'perl /dfs/FW_HMSC/Omalley_Lab/dayan/software/GTseq-Pipeline/OmySEX_test_v3.pl'
##### don't forget to remove these dups when done
Compile Genotypes
After all the .genos are written, we compile them into a single output using the GenoCompile script
#this is run from within the .genos directory
SGE_Batch -q harold -r compile -c 'perl /dfs/FW_HMSC/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_GenoCompile_v3.pl > ../genotypes/half_pounder_GTs_0.1.csv'
Congratulations, you’ve run the pipeline! Now we will need to do quality control and filtering.
At this point in the suggested protocol, you should run the SummaryFigures script from the GTseq pipeline. However, this script contains harcoded values for genotype calling that are different from the actual genotyping script and will misrepresent the data. Do not run it. We will produce the same figures in the next section.
Here we do some basic quality control. We check positive (known good DNA) and negative (known blank samples) controls, as well as control for known genotypes (i.e. known winter run, etc). After controls, we check for concordance among sample replicates to check that no wetlab errors occured that might have scrambled indexing/barcoding.
First let’s check that the controls worked well. We will check that negative controls have much fewer reads than average (there may be some on-target reads from othr samples due to index sequence error)
# LOCAL R
# first I cleaned up the sample names in the output compiled genotype file (.csv) with regex in a text editor - separated adapter info from sample name etc, you may not need to do this depending on who did your demultiplexing
# then read this file in to R
genos_0.1 <- read_csv("example_data/genotype_data/example_GTs_0.1.csv")
# lets set a value to mark controls
# here controls contained "positive," "negative" in their sample names so used simple pattern matching to create a new column
genos_0.1 <- genos_0.1 %>%
mutate(control = ifelse(grepl("positive", Sample), "positive", ifelse(grepl("negative", Sample), "negative", "sample")))
# great let's plot
ggplot()+geom_histogram(data = genos_0.1, aes(x = `On-Target Reads`, fill= control)) + theme_classic()
Looks good. Negative controls have very few reads, positive are distributed, but lets just double check that there isn’t a negative control with a lot of reads hiding in there and indicating a plate flip:
ggplot()+geom_histogram(data = genos_0.1[genos_0.1$control=="negative",], aes(x = `On-Target Reads`)) + theme_classic()
Uh-oh, a negative control with ~6300 on target reads, but maybe there’s just a lot of reads at this adapter/well, lets examine as a portion of total reads, and also the portion GT’d.
ggplot()+geom_histogram(data = genos_0.1, aes(x = `%On-Target`, fill= control)) + theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Okay, that looks a lot better. The negative control with a lot of reads only has very small amount on target, much lower than most samples and positive controls. This suggests it is unlikely to be due to a wetlab error like flipping a plate upside down. Positive and negative controls check out. Also of note here is that the positive controls are biased down with respect to on target proportion, suggesting these samples are getting old.
Some samples were replicated, let’s check for concordance in the genotypes, the pick the sample with better GT success and throw out the duplicate.
#LOCAL R
# here we filter out our known controls and create our next dataset genos_0.11
genos_0.11 <- genos_0.1 %>%
filter(control == "sample") %>%
filter(Sample != "Winter") %>%
filter(Sample != "Heterozygous") %>%
filter(Sample != "Summer")
#now let's get duplicated samples
dups <- genos_0.11[genos_0.11$Sample %in% genos_0.11$Sample[duplicated(genos_0.11$Sample)],]
dups <- dups[order(dups$Sample),]
# next we'll calculate the percent concordance among replicates
# woof I don't see a good way around using a nested for loop here, maybe fix this in the future
dups_genos <- dups[,8:ncol(dups)] #caution, possible hardcoding here, grab genotpyes and leave metadata out
rep_info <- matrix(ncol=ncol(dups_genos), nrow=nrow(dups_genos)/2)
colnames(rep_info) <- colnames(dups_genos)
for (j in 1:(nrow(dups_genos)/2)) {
for (i in 1:ncol(dups_genos)) {
rep_info[j,i] <- sum(dups_genos[(j*2)-1,i]==dups_genos[(j*2),i])
}
}
geno_concordance <- as.data.frame(as.matrix(rep_info)) %>%
rowMeans()
rep_data <- as.data.frame(cbind(dups[c(1:length(geno_concordance))*2,2], geno_concordance))
ggplot(data=rep_data)+geom_histogram(aes(x=geno_concordance))+theme_classic()
There are ~50 replicated samples. Mean concordance of genotype across replicates is 87.3%. Next let’s examine whats going on with the samples with very low concordance.
#get the bad samples
bad_reps <- genos_0.11[genos_0.11$Sample %in% rep_data[rep_data$geno_concordance<0.50,1],1:7]
bad_reps[order(bad_reps$Sample),]
One of the pair just has extremely low %On-Target reads
Replication Summary
Replicate samples look good, bad replicates (<50% concordance) seems to be due to one replicate of the pair having extremely low GT success.
Next let’s make the 0.2 dataset (i.e. remove the replicates with lower GT success).
# LOCAL R
#this writes a new dataset (0.2) by choosing the samples within duplicates and keeping the one with the highest genotyping success
genos_0.2 <- genos_0.11 %>%
group_by(Sample) %>%
filter(`On-Target Reads` == max(`On-Target Reads`))
Control and replicates have been removed, now it’s time for filtering.
Filtering Summary
We take an iterative approach to filtering:
First remove worst individuals and genotypes: - GTperc_cutoff=30 (indivudals greater than 30% missing data excluded) - Missingness (loci) > 50% (loci with total missing data > 50% removed) - IFI_cutoff = 10 (i.e. >10% background reads)
Then recalculate missingness and IFI - IFI_cutoff=2.5
- GTperc_cutoff=90 (inds greater than 10% missing data excluded)
- Missingness (loci) > 20%
Then examine for paralogues among markers with
- Missingness (loci) > 10% - examine for allele correction issues
- Markers where heterozygotes and “in-betweeners” do not follow 1:1 ratio of allele counts - Markers with high variance in ratio of allele counts at heteroyzgotes and “in-betweeners” - Remove monomorphic SNPs
File readme
A key to different outputs in the filtering pipeline: XXXXXXXXXXx this needs updating ingore for now XXXXXXXXx
0.1: Raw, unfiltered GTs including all controls and replicates
0.2: Raw, unfiltered GTs, replicates and controls removed
0.3: filtered for IFI, GTperc (inds), and Missingness (loci) > 20%
0.4: filtered for IFI, GTperc (inds), and Missingness (loci) > 20%, and allele correction issues
1.0: 0.4 + removed monomorphic
2.0: 1.0 + duplicate sample removal
2.2: final genotype dataset with metadata (population, run, sample date)
First we filter individuals and loci on IFI, and missingness.
Let’s take a look at the distribution of these values before any filtering
ggplot(genos_0.2)+geom_histogram(aes(x=IFI))+geom_vline(aes(xintercept= 2.5), color="red")+theme_classic()
ggplot(genos_0.2)+geom_histogram(aes(x=`%GT`))+geom_vline(aes(xintercept= 90), color="red")+theme_classic()
missingness <- (colSums(genos_0.2[,c(8:(ncol(genos_0.2)-1))] == "00" | genos_0.2[,c(8:(ncol(genos_0.2)-1))] == "0"))/nrow(genos_0.2) #warning hardcoding: "[,8:398]" is hardcoded to work on the example script using the Omy panel with 390 markers, these values will need to be changed to reflect the genotype columns of the genos r object that YOU are running. This excludes columns with metadata and genotyping results such as "sample name" "ifi" "on-target reads" etc
missing <- as.data.frame(missingness)
missing$marker <- row.names(missing)
ggplot(missing) + geom_histogram(aes(x=missingness))+geom_vline(aes(xintercept= 0.2), color="red")+geom_vline(aes(xintercept= 0.1), color="blue")+theme_classic()+xlab("missingness (loci)")
Now let’s make the datasets. The first step is to collect some information about genotying success from the .genos files. We’ll do this with an awk one liner.
The script below will pull the allele count ratios and read counts for all individuals in the pipeline
# SERVER
#run from directory with your .genos
#collect marker info from all the genos files
touch marker_info.csv
echo 'ind,marker,a1_count,a2_count,called_geno,a1_corr,a2_corr' >> marker_info.csv
for file in ./*genos
do
awk -F"," ' BEGIN { OFS="," } {print FILENAME,$1,$2,$3,$6,$7,$8}' $file >> marker_info.csv
done
# now we'll cleanup this file a little bit
sed -i '/Raw-Reads/d' ./marker_info.csv #first get rid of genos headers
# now clean up the "ind" field so that it matches your data
# depending on the file naming convention you used for your individuals this may need to change, for the example script we removed the leading "./" and the trailing lane information and file extension (e.g. "_L002_R1_001.genos"), but there's no way to standardize this step unless we standardize naming conventions all the way back to the library prep logs
Read in the marker info file and clean it up.
marker_info <- read_csv("example_data/genotype_data/marker_info.csv")
#this part changes the values of A=2, G=898, -=52, etc for the allele count columns to the actual values
marker_info$a1_count <- as.numeric(substr(marker_info$a1_count, 3, nchar(marker_info$a1_count)))
marker_info$a2_count <- as.numeric(substr(marker_info$a2_count, 3, nchar(marker_info$a2_count)))
0.3: Extremely Bad Loci and Individuals Excluded
First remove the individuals and markers that clearly failed to genotype correctly (one step at a time)
#print table of bad missingness individual
kable(genos_0.2 %>%
filter(`%GT` < 70) %>%
select(2:7), caption = "Individuals with high missingess (>30% missing data)")
| Sample | Raw Reads | On-Target Reads | %On-Target | %GT | IFI |
|---|---|---|---|---|---|
| OmyAC19ROGR_4914 | 331506 | 4818 | 1.45 | 38.87 | 2.40 |
| OmyAC19ROGR_5722 | 271387 | 2814 | 1.04 | 23.27 | 2.36 |
| OmyJC19ROGR_0311 | 239390 | 9735 | 4.07 | 67.77 | 1.03 |
# now remove them
genos_0.3 <- genos_0.2 %>%
filter(`%GT` > 70)
#now recalculate locus level missingness after removing the worst individuals
missingness2 <- (colSums(genos_0.3[,c(8:(ncol(genos_0.3)-1))] == "00" | genos_0.3[,c(8:(ncol(genos_0.3)-1))] == "0"))/nrow(genos_0.3) #warning hardcoding: "c(8:(ncol(genos_0.3)-1))" is hardcoded to work on the example script. make sure this this only grabbing the columns that contian genotype data and not other columns (last column should be sample type, first 7 columns should have individual level summary data ) e.g. IFI
missing2 <- as.data.frame(missingness2)
missing2$marker <- row.names(missing2)
#then remove these markers
# collect bad markers
very_bad_markers <- missing2[missing2$missingness2>0.5, 2]
print(paste(length(very_bad_markers), "markers with > 50% missing data"))
## [1] "3 markers with > 50% missing data"
#write the new dataset
genos_0.3 <- genos_0.3 %>%
dplyr::select(-one_of(very_bad_markers))
#then recalculate IFI
# IFI is equal to the percentage of "background" reads to homozygote reads. Two types of reads contribute to background count: (1) Reads from the alternative allele when an individual has been called as homozygote at a locus, and (2) reads from the less frequent allele when the individual has been called as "in-betweener". We update the IFI score by including only markers in the filtered dataset
IFI <- marker_info %>%
filter(marker %in% colnames(genos_0.3)) %>%
group_by(ind) %>%
summarize(back_count = sum(a1_count[called_geno == "A2HOM"], na.rm = TRUE)
+ sum(a2_count[called_geno == "A1HOM"], na.rm = TRUE)
+ sum(a1_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a2_count > a1_count)], na.rm = TRUE )
+ sum(a2_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a1_count > a2_count)], na.rm = TRUE ),
hom_ct = sum(a1_count[called_geno == "A1HOM"], na.rm = TRUE)
+ sum(a2_count[called_geno == "A2HOM"], na.rm = TRUE)
+ sum(a2_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a2_count > a1_count)], na.rm = TRUE )
+ sum(a1_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a1_count > a2_count)], na.rm = TRUE ),
ifi2 = (back_count/hom_ct)*100)
# the "marker_info" file we produced earlier used the filename of the genos file as the sample name (column name "ind"), but the sample names in our local R dataframes are very cleaned up (see line 504). Here I attempt to do the same using some regex in R using the standardized codes for sample naming at SFGL, but note that depending on how your fastq files are named, these exact matches may not work for you
# until we find a better solution I suggest two alternatives if this regex below breaks
# 1: if the number of high IFI samples is very low, just write the sample names out manually to a vector and use this to filter
# 2:
IFI$sample <- str_extract(IFI$ind, "[:upper:][:lower:]{2}[AJCU][RC]\\d{2}\\w{4}_\\d{4}")
IFI$adapter <- str_replace(IFI$ind, "(\\w+)[-_]([:upper:][:lower:]{2}[AJCU][RC]\\d{2}\\w{4}_\\d{4}).*", "\\1")
genos_0.3 <- genos_0.3 %>%
left_join(select(IFI, sample, adapter, ifi2), by = c("Sample" = "sample", "Adapter" = "adapter")) %>%
mutate(IFI = ifi2) %>%
select(-one_of("ifi2"))
# now filter on IFI
#print table of bad IFI samples
kable(genos_0.3 %>%
filter(IFI >10) %>%
select(2:7), caption = "Extreme High IFI (>10) samples (low confidence barcodes)")
| Sample | Raw Reads | On-Target Reads | %On-Target | %GT | IFI |
|---|
#update the dataset
genos_0.3 <- genos_0.3 %>%
filter(IFI < 10)
Filtering log 0.2 -> 0.3:
0 inds removed with IFI > 10
3 inds removed with genotying success less than 70%
3 loci removed with > 50% missingness
0.4 Second Iteration Filter
Next we do the same process, but at the final filtering levels:
#print table of bad missingness individual
kable(genos_0.3 %>%
filter(`%GT` < 90) %>%
select(2:7), caption = "Individuals with high missingess (>10% missing data)")
| Sample | Raw Reads | On-Target Reads | %On-Target | %GT | IFI |
|---|---|---|---|---|---|
| OmyAC19ROGR_0131 | 794216 | 247048 | 31.11 | 88.75 | 2.0472983 |
| OmyAC19ROGR_0138 | 563519 | 232629 | 41.28 | 88.49 | 3.1252756 |
| OmyAC19ROGR_0143 | 499500 | 69489 | 13.91 | 88.49 | 2.8041074 |
| OmyAC19ROGR_4847 | 331766 | 22310 | 6.72 | 86.70 | 0.8472525 |
| OmyAC19ROGR_4947 | 514196 | 13727 | 2.67 | 70.59 | 1.2348285 |
| OmyAC19ROGR_5052 | 335644 | 14778 | 4.40 | 72.89 | 2.9341874 |
| OmyAC19ROGR_5195 | 322639 | 12975 | 4.02 | 70.84 | 1.9589767 |
| OmyAC19ROGR_5507 | 429546 | 21564 | 5.02 | 79.54 | 3.6194794 |
| OmyAC19ROGR_5724 | 365075 | 16410 | 4.49 | 80.05 | 0.9201171 |
| OmyAC19ROGR_5725 | 480551 | 13611 | 2.83 | 79.80 | 1.4192849 |
| OmyJC19ROGR_0069 | 183568 | 22606 | 12.31 | 86.70 | 0.4337986 |
| OmyJC19ROGR_0086 | 198018 | 12917 | 6.52 | 72.12 | 1.4722894 |
| OmyJC19ROGR_0088 | 234849 | 22800 | 9.71 | 85.68 | 2.3414691 |
| OmyJC19ROGR_0140 | 329422 | 24236 | 7.36 | 88.24 | 0.8747722 |
| OmyJC19ROGR_0193 | 299946 | 17996 | 6.00 | 87.47 | 0.6919100 |
# now remove them
genos_0.4 <- genos_0.3 %>%
filter(`%GT` > 90)
#now recalculate locus level missingness after removing the worst individuals
missingness3 <- (colSums(genos_0.4[,c(8:(ncol(genos_0.4)-1))] == "00" | genos_0.4[,c(8:(ncol(genos_0.4)-1))] == "0"))/nrow(genos_0.4) #warning hardcoding: "c(8:(ncol(genos_0.3)-1))" is hardcoded to work on the example script. make sure this this only grabbing the columns that contian genotype data and not other columns (last column should be sample type, first 7 columns should have individual level summary data ) e.g. IFI
missing3 <- as.data.frame(missingness3)
missing3$marker <- row.names(missing3)
#then remove these markers
# collect bad markers
bad_markers <- missing3[missing3$missingness3>0.2, 2]
print(paste(length(bad_markers), "markers with > 20% missing data"))
## [1] "9 markers with > 20% missing data"
#write the new dataset
genos_0.4 <- genos_0.4 %>%
dplyr::select(-one_of(bad_markers))
#then recalculate IFI
# IFI is equal to the percentage of "background" reads to homozygote reads. Two types of reads contribute to background count: (1) Reads from the alternative allele when an individual has been called as homozygote at a locus, and (2) reads from the less frequent allele when the individual has been called as "in-betweener"
IFI <- marker_info %>%
filter(marker %in% colnames(genos_0.4)) %>%
group_by(ind) %>%
summarize(back_count = sum(a1_count[called_geno == "A2HOM"], na.rm = TRUE)
+ sum(a2_count[called_geno == "A1HOM"], na.rm = TRUE)
+ sum(a1_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a2_count > a1_count)], na.rm = TRUE )
+ sum(a2_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a1_count > a2_count)], na.rm = TRUE ),
hom_ct = sum(a1_count[called_geno == "A1HOM"], na.rm = TRUE)
+ sum(a2_count[called_geno == "A2HOM"], na.rm = TRUE)
+ sum(a2_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a2_count > a1_count)], na.rm = TRUE )
+ sum(a1_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a1_count > a2_count)], na.rm = TRUE ),
ifi2 = (back_count/hom_ct)*100)
# the "marker_info" file we produced earlier used the filename of the genos file as the sample name (column name "ind"), but the sample names in our local R dataframes are very cleaned up (see line 504). Here I attempt to do the same using some regex in R using the standardized codes for sample naming at SFGL, but note that depending on how your fastq files are named, these exact matches may not work for you
# until we find a better solution I suggest two alternatives if this regex below breaks
# 1: if the number of high IFI samples is very low, just write the sample names out manually to a vector and use this to filter
# 2:
IFI$sample <- str_extract(IFI$ind, "[:upper:][:lower:]{2}[AJCU][RC]\\d{2}\\w{4}_\\d{4}")
IFI$adapter <- str_replace(IFI$ind, "(\\w+)[-_]([:upper:][:lower:]{2}[AJCU][RC]\\d{2}\\w{4}_\\d{4}).*", "\\1")
genos_0.4 <- genos_0.4 %>%
left_join(select(IFI, sample, adapter, ifi2), by = c("Sample" = "sample", "Adapter" = "adapter")) %>%
mutate(IFI = ifi2) %>%
select(-one_of("ifi2"))
# now filter on IFI
#print table of bad IFI samples
kable(genos_0.4 %>%
filter(IFI >2.5) %>%
select(2:7), caption = "High IFI (>2.5) samples (low confidence barcodes)")
| Sample | Raw Reads | On-Target Reads | %On-Target | %GT | IFI |
|---|---|---|---|---|---|
| OmyAC19ROGR_0142 | 642721 | 286827 | 44.63 | 91.82 | 2.596604 |
#update the dataset
genos_0.4 <- genos_0.4 %>%
filter(IFI < 2.5)
Note that the table above is blank in the example script because 0 individuals showed high contamination.
0.3 -> 0.4 Filtering Log
Filtered out:
15 individuals with <90% genotying success (i.e. greater than 10% missing data)
9 markers with > 20% missingness
0 contaminated samples (note here that all the samples with high IFI are already removed by the individual level missingness step in the example data)
0.5: Removing Paralogs
Now we manually examine allele counts for markers that may tag paralogues regions. Because our panels can contain hundreds of loci, we flag three types of markers for close scrutiny (below), but this is informal and you can also look at any marker you want using some of the scripts below.
- Missingness (loci) > 10% - examine for allele correction issues
- Markers where heterozygotes and “in-betweeners” do not follow 1:1 ratio of allele counts - Markers with high variance in ratio of allele counts at heteroyzgotes and “in-betweeners”
Let’s collect these markers, first markers with high missingness (10-20% missingness)
# Local R
#get marker names of markers with 0.1 > missingness > 0.2
miss0.1 <- missing3[missing3$missingness3 > 0.1,]
miss_mod <- miss0.1[miss0.1$missingness3 < 0.2, 2]
Next, markers with skewed allele count ratios and allele ratios with high variance. We do this by fitting a linear model between allele 1 counts and allele 2 counts and then flagging markers with a ratio of > 1.5 (3/2) and less than 2/3. We also flag markers where the fit
library(lme4)
hets <- filter(marker_info, called_geno == "HET" | is.na(called_geno))
models <- hets %>%
filter(marker %in% colnames(genos_0.4)) %>%
filter(is.na(a1_count) == FALSE & is.na(a2_count) == FALSE) %>%
group_by(marker) %>%
group_map(~ lm(a1_count ~ a2_count, data= .))
# Apply coef to each model and return a list of allele count ratios
lms <- lapply(models, coef)
ggplot()+geom_histogram(aes(x = sapply(lms,`[`,2)))+theme_classic()+ggtitle("allele ratios for all NA and HET calls")+geom_vline(aes(xintercept = 1.5), color = "red", linetype = 2)+geom_vline(aes(xintercept = (2/3)), color = "red", linetype = 2)+xlab("allele ratio (a1/a2)")+geom_vline(aes(xintercept = 1), color = "black")
#list of p-values
lms_anova <- lapply(models, summary)
# collect info about each bad model
paralog_possible <- which(abs(sapply(lms,`[`,2)) > 1.5) #bad because a positively skewed allele ratio
paralog_possible2 <- which(abs(sapply(lms,`[`,2)) < (2/3)) # bad because a negative skewed allele ratio
paralog_possible3 <- which(sapply(lms_anova, function(x) x$coefficients[,4][2])> 0.01) # bad because too much variance in allele ratio, even if mean ratio is 1
paralog_possible <- c(paralog_possible, paralog_possible2, paralog_possible3)
There are 5 markers with moderately poor genotyping success (i.e. less than 20% missingess, but greater than 10% missingness) in the example data. For this example pipeline we’re using some good markers to visualize this process instead.
Previously these types of markers were filtered on an ad hoc basis based on the extent of individuals with unclear genotypes (a lot of individuals not falling in the clear cutoffs for GT). Will attempt the same below:
Let’s explore look at one of the markers individual. We’ll create a similar plot to those generated by the GTseq figures script.
Below we show plots for one representative good marker, and two representative bad markers with separate issues. Then we’ll discuss
# R LOCAL
# this is an example code chunk don't run it in your own pipeline
marker_info <- read_csv("example_data/genotype_data/marker_info.csv")
marker_info$a1_count <- as.numeric(substr(marker_info$a1_count, 3, nchar(marker_info$a1_count)))
marker_info$a2_count <- as.numeric(substr(marker_info$a2_count, 3, nchar(marker_info$a2_count)))
ggplot(data=marker_info[marker_info$marker=="OMS00008",])+geom_point(aes(a1_count, a2_count, color = called_geno), alpha = 0.7, size =2)+theme_classic()+geom_abline(aes(slope=1, intercept=0))+geom_abline(aes(slope = 10, intercept=0), color = "green")+geom_abline(aes(slope = 0.1, intercept=0), color = "red")+geom_abline(aes(slope = 0.2, intercept=0), color = "blue")+geom_abline(aes(slope = 5, intercept=0), color = "blue")+coord_equal(ratio=1)+geom_abline(slope = -1, intercept = 10)+xlim(c(0,150))+ylim(c(0,150))+ggtitle("Good Marker")
ggplot(data=marker_info[marker_info$marker=="Omy_RAD46672-27",])+geom_point(aes(a1_count, a2_count, color = called_geno), alpha = 0.7, size =2)+theme_classic()+geom_abline(aes(slope=1, intercept=0))+geom_abline(aes(slope = 10, intercept=0), color = "green")+geom_abline(aes(slope = 0.1, intercept=0), color = "red")+geom_abline(aes(slope = 0.2, intercept=0), color = "blue")+geom_abline(aes(slope = 5, intercept=0), color = "blue")+coord_equal(ratio=1)+geom_abline(slope = -1, intercept = 10)+xlim(c(0,150))+ylim(c(0,150))+ggtitle("Bad Marker, A2 Bias")
ggplot(data=marker_info[marker_info$marker=="Omy_RAD52458-17",])+geom_point(aes(a1_count, a2_count, color = called_geno), alpha = 0.7, size =2)+theme_classic()+geom_abline(aes(slope=1, intercept=0))+geom_abline(aes(slope = 10, intercept=0), color = "green")+geom_abline(aes(slope = 0.1, intercept=0), color = "red")+geom_abline(aes(slope = 0.2, intercept=0), color = "blue")+geom_abline(aes(slope = 5, intercept=0), color = "blue")+coord_equal(ratio=1)+geom_abline(slope = -1, intercept = 10)+xlim(c(0,150))+ylim(c(0,150))+ggtitle("Bad Marker\nA1 bias in a subset of samples")
The good marker here demonstrates the qualities we’re looking for. Heterozygotes are distributed around a 1:1 ratio of alternative alleles, and homozygotes have few to no reads at the alternative allele.
The first bad marker shows a typical issue, reads among putative heterozygotes are biased towards the A2 allele. This is potentially due to a paralogue at this locus and we would filter this SNP out of the final dataset (or attempt to rescue it by changing the allele correction values if we are optimizing the panel for the first few runs).
The second bad marker shows a less common problem. There is bias towards the A1 allele, but only in a subset of individuals. One explanation for this observation is that there is polymorphism among the paralogs. In any case, toss this locus.
When actually running the pipeline for yourself use the R code chunk below. This code chunk builds similar graphs for any bad loci (missingess 10% - 20%, skewed and high variance allele count ratios) and allows you make a decision on each one.
# R Local
plots <- marker_info %>%
filter(marker %in% colnames(genos_0.4)) %>%
filter(is.na(a1_count) == FALSE & is.na(a2_count) == FALSE) %>%
group_by(marker) %>%
do(plots=ggplot(data=.)+geom_point(aes(a1_count, a2_count, color = called_geno))+theme_classic()+geom_abline(aes(slope=1, intercept=0))+geom_abline(aes(slope = 10, intercept=0), color = "green")+geom_abline(aes(slope = 0.1, intercept=0), color = "red")+geom_abline(aes(slope = 0.2, intercept=0), color = "blue")+geom_abline(aes(slope = 5, intercept=0), color = "blue")+coord_equal(ratio=1)+geom_abline(slope = -1, intercept = 10)+ggtitle(unique(.$marker)))
#plot all "bad markers"
#first add the missningness markers to the list to examine
mod_bad_plot_index <- which(plots$marker %in% miss_mod)
paralog_possible <- c(mod_bad_plot_index, paralog_possible)
# then loop through the plots by changing the index (here 33) until you have looked at all your questionable markers
plots$plots[[paralog_possible[10]]] #manually looped through these plots by changing the index for all 33 moderately bad markers, could make an lapply loop in the future, bad markers reported below
Remove the bad markers
# Local R
to_filt <- c("Omy_RAD46672-27", "Omy_RAD52458-17") # here list your bad marker names, if you have so many that this is unwieldy check out code snippet at bottom of this chunk
genos_0.5 <- genos_0.4 %>%
dplyr::select(-one_of(to_filt))
1.0 Monomorphic Markers
To generate the 1.0 dataset, we remove monomorphic markers
genos_1.0 <- genos_0.5 %>%
select_if(~ length(unique(.)) > 1)
Duplicate Samples
Some sample tissues are provided in batches of fin clips. Let’s make sure no fin clips broke apart leading to a single individual to be represented twice in the dataset. Rather than fussing with installing coancestry for windows on a unix system, estimated relatedness using an R package (related) which can implement the code from Coancestry.
To run the code in Coancestry on a windows machine, use the GUI.
We used the estimator from Lynch and Ritland 1999 #not dyadic likelihood estimator, Milligan (2003)
# Local R
# The input file needs a unique row for each indiviudal and two columns for each diploid locus
# threw out metadata and wrote to a file
# then we split all the genotype values using regex in a text editor (after converting all na values to 00)
# also convert indels / big probes (denoted with a "-" to missing, as related only wants SNPs)
# find string: \t([ATGC0XY])([ATCG0XY]) replace string: \t\1\t\2
# convert genos to numbers and removed sex marker
#convert to integer T-1 G->2 etc
just_genos <- genos_1.0[,c(2, c(8:(ncol(genos_1.0)-1)))] #note possible hardcoding here (just like missingness), if this breakes edit the columns so that it grabs only the sample name and genotype data
write_tsv(just_genos, "./example_data/genotype_data/just_genos.txt")
#now do the regex
# now run coancestry
#rmat <- coancestry("./genotype_data/just_genos.txt", dyadml = 1)
rmat2 <- coancestry("./example_data/genotype_data/just_genos.txt", lynchrd = 1)
# save the relevant info so we don't have to run this over and over and take up a ton of diskspace
rmat_to_save <- rmat2$relatedness[rmat2$relatedness$lynchrd > 0.5,]
save(rmat_to_save, file="./example_data/genotype_data/relatedness.Rdata")
Check for highly related individuals and remove any >= 0.95 from the dataset
# LOCAL R
#Check for relatedness
load(file = "./example_data/genotype_data/relatedness.Rdata")
#ggplot(rmat_to_save$relatedness)+geom_histogram(aes(x=lynchrd))+theme_classic()
rmat_to_save[which(rmat_to_save$lynchrd >=0.95), c(1:3)]
dup_inds <- rmat_to_save[which(rmat_to_save$lynchrd >= 0.95), c(1:3)]
#if you used the coancestry GUI, you can just create a vector here manually like below
#dup_inds <- c("dupicate sample name 1", "dupicate sample name 2" , etc)
genos_2.0 <- genos_1.0 %>%
filter(!(Sample %in% dup_inds$ind2.id))
Final step of genotyping is to collect some stats about the genotype dataset and reformat the genotype file into common formats for import into other programs.
Here are some summary stats and figures from your filtered dataset
# LOCAL R
ggplot(genos_2.0)+geom_density(aes(x=`On-Target Reads`))+geom_vline(aes(xintercept=median(`On-Target Reads`)), color = "red") +theme_classic()
On Target Read Distribution
#LOCAL R
ggplot(genos_2.0)+geom_density(aes(x=`%On-Target`))+geom_vline(aes(xintercept=median(`%On-Target`)), color = "red") +theme_classic()
Proportion on Target
Depths
#LOCAL R
#code to estimate depth at filtered loci
marker_info %>%
filter(marker %in% colnames(genos_2.0)) %>%
mutate(sumdepth=a1_count+a2_count) %>%
summarise(mean=mean(sumdepth, na.rm = TRUE), median=median(sumdepth, na.rm = TRUE), sd=sd(sumdepth, na.rm = TRUE))
marker_info %>%
filter(marker %in% colnames(genos_2.0)) %>%
mutate(sumdepth=a1_count+a2_count) %>%
ggplot + aes(x=sumdepth)+geom_histogram()+theme_classic()+xlab("Mean Depth Per Locus Per Individual")
Let’s get some usable file formats
Here’s adegenet’s genind object
#LOCAL R
# Convert to genind for import into adegenet
#first get a matrix to work on
#first change column to not include a dot
genos_2.1 <- genos_2.0
colnames(genos_2.1) <- gsub("\\.", "_", colnames(genos_2.1))
#convert to matrix with inds as row names
genos_2.1 <- as.matrix(genos_2.1[,c(8:362)]) #caution hardcoding, make sure you select just your marker columns
row.names(genos_2.1) <- genos_2.0$Sample
genind_1.0 <- df2genind(genos_2.1, sep ="", ploidy=2,NA.char = "0")
#add in the populations
genos_2.2 <- genos_2.0 %>%
left_join(meta_data, by=c("Sample" = "ID"))
genind_1.0@pop <- as.factor(genos_2.2$run)
Here’s a general approach using radiator package
# LOCAL R
# note didn't do this yet, but check out the command:
radiator::genomic_converter()
Finally, save your files as R objects for further analysis.
# LOCAL R
# here we save a few objects with useful info
genind_2.0 <- genind_1.0
save(genos_2.2, file ="./genotype_data/genotypes_2.2.R")
save(genind_2.0, file= "./genotype_data/genind_2.0.R")